blog - Building Multitenant SaaS with Claude

Intro

I’ve spent the last 5 months working on a YMS (Yard Management System) for a company that’s been in this business for a while in the analog/meat space world. Really it’s my first foray into building a product from the ground up, well really rebuilding because they already have an older application that has served them well but it’s time for an upgrade and maybe a dream of turning this into a service we can sell to other companies/warehouses/factories/distributors that have a need to managing trailers at their yards. In addition to replacing their old system for managing yards, we’re integrating a new RFID technology to help more precisely track trailers around the yards.

I was brought onboard to help a friend provide this whole technology refresh for the company. They’ve had an application that’s managed by 3rd party contractors for a while now written in complicated Java Spring and hosted on Amazon and mostly was entirely managed by them. So by bringing us on, we can bring it all in-house and manage it better. We can deliver faster iterations, faster feedback, better communication, and hopefully better business value acumen. All this on top of trying to integrate an RFID solution that provides better tracking and realtime visibility of trailers for the company and their clients.

All of this is happening while we’re in this new LLM Age. If you’ve been using AI then you might agree that they’re seemingly quite incredible, trained on the vast corpus of all humanity’s outputs these machines have successfully vacuumed up most of the intelligent patterns that we’ve used to build and run our society. Unfortunately they’re not discerning machines with incredible judgement, and they mostly provide the median/mean of all of humanity’s works. They won’t help build that novel abstraction that could be a game changer unprompted, they can’t help you design novel UI/UX that can truly differentiate the business. For myself it remains to be seen if these can even build a meaningful application without major errors and without bad architectures. I’ve been hearing about large agent orchestration systems like Gas Town and other agentic orchestration loops that can take some idea and iterate on it, but I’m certain the devil comes down to the details, people don’t like details mostly, I’ve come to find. They just want the thing to work, but a system is built up from smaller systems: every flaw in the subsystem becomes potentially amplified, every flaw in the details becomes a permanent flaw in the behavior of the application adding friction to the users, and worst of all flaws become hard to undo once they’re in place and without system knowledge undoing them will break other things you’re not aware of, and the true fatality is that without constant over-site you lose knowledge of the system and may not be able to make the precise fix.

I’ve seen firsthand how as a system goes from brand new and greenfield to even moderately complex, whole features start intertwining on their interfaces and on the core models. Thankfully we have type systems that at least help us identify across the system when make an update to a model in one place it will tell us if we’re using it wrong else where. LLM’s have made it amazing to reflect a model change across most of the system, but I still find it misses stuff. If you add a new attribute to you main domain model, then all rest endpoint schemas need to reflect that change, all of your DB schemas need to reflect that change, the DB might have to have new constraints and foreign keys added, your CSV export/import system needs that change, you’re email service needs the change reflected, and of course their’s all the frontend UI where this new attribute might need to appear. Core business logic is hard won, the exceptions to the rules matter. Someone can just say “just do this,” but actually the business domain is “do this, but if this happens then you do that, or if this happens instead do this” and then when you execute each one of those if’s it reveals another tree of “ifs” that might need to be conditionally executed, and this recurses all the way down until the bottom conditions are hit. I guess this is what people call complexity, of course there’s been plenty of arguments over accidental and essential complexity. [0][1]

Of course this knowledge doesn’t come easy, it comes from discussions with the stakeholders: the CEO, HR, IT Support teams, The managers and supervisors, the bottom-most employee. Without this knowledge of essential complexity, the system either won’t work correctly or worse won’t even provide value to the customer. You have to protect that essential complexity, it’s what gives your system value. You can do this through requirements gathering, keeping documentation up to date, having proper tests and testing often and carefully review test changes and additions. We also have the benefit (and maybe the shackles) of having an existing system that we are seeking to replace as a source of potential truth. Of course this can come at a detriment because it’s a known problem that sometimes existing system settle into suboptimal solutions especially when a new solution comes along to replace it, there can be resistance by some stakeholders because they may be unwilling to change their process to a new system, these are certainly considerations to be had with designing this.

Design

When starting a project like this, where you’re trying to model a real world process for accurate tracking, it’s obviously best to spend some time and put some thought into what you’re doing upfront and this can pay off dividends later rather than rushing into it. I grabbed a whiteboard immediately started by modeling the states and transition events into a state machine like diagram, discussing the normal process with the other coworkers present who were most familiar with how the business operated. From here we can spend a day or two modeling the business process and trying to capture important business details and unintuitive exceptions or more nuanced details, this seeps in over time and isn’t revealed immediately.

When modeling processes like this you usually look to something like Domain Driven Design [2], we integrate analog roles into digital ones. Before in meat space you’d have “client” request that trailer X be sent to some dock Y or some other yard/location, and a “supervisor” would take that call from the “client”, record it in a log, and then dispatch a “driver” to complete the trailer move request. All of these analog roles become digital ones in our system. These roles encode the capabilities and least-privilege permissions of the users with that role. This is known Role Based Access Control (RBAC). This contributes to more complexity of the software, especially when you take it a step further with Attribute Based Access Control (ABAC) where you perform more access checks by looking at the user and if they have permission to perform that action on that resource give a set of policy conditions that must be evaluated(e.g. A “driver” can only change the status of their own move, not someone else’s).

We also know we wanted real-time updates and interactivity throughout the App. If something happened, we want a notification to popup, so that we can notify people of critical information.

Also since we knew we were going to be using RFID tags and readers for realtime asset tracking throughout the yard, we needed to design for this. How was this going to integrate into the system? We can use tailscale and connect the scanners using a cellular modem via a raspberry pi

Tech Stack

of course we’re going to be building this on top of the Web, so a modern web framework is usually the go to. Web frameworks help remove complexity by abstracting the HTTP transport logic from the application logic, along with handling things like url parsing, http status codes, serialization of data to json for response bodies, etc. There were options available to us, sometimes you choose something because it’s familiar and aids in development velocity, and my friend was familiar with NextJS [4]/JavaScript, I hadn’t worked with NextJS before and it had been several years since I had done a lot of work in JavaScript - previously I was working as a backend engineer on Python with FastAPI [5] - and so I took it upon myself to learn NextJS as quickly as possible in the first few weeks of the job.

This is where I think LLMs can be immensely helpful, you can ask it questions about well known technologies and it can answer rapidly and it can treat it like a document expert to help round out your personal understanding of a technology to get onboarded rapidly and get surface level understanding, then you can delve into various more technical aspects to get a better understanding on some deep aspect. It’s truly wonderful to have you own learning assistant and not have to crawl the Web anymore or look through high noise comments on Stack Overflow to find one high signal one. You can just talk to Claude and have an in-depth discussion. It’s truly amazing.

So along with the web framework we also have our persistent layer, how we’re going to persist storage of our data, which we chose NeonDB [6] which is another technology I wasn’t familiar with, but it offers PostgresSQL DB with some cool features like branching so you can branch off and have the same schema and data as parent but isolated branch specific data. It’s a pretty cool feature, and honestly Postgres is great!

We actually use tRPC to drive alot of the main logic between backend and client, and the nice thing about tRPC is that you get end-to-end type safety! Types are honestly so amazing and I love the design process and guarantees you get with language types like you get with TypeScript. While doing Python work, I started picking up more on the type system and it was a game changer for how viewed building with code. It’s also built on top of Tanstack React Query [7], so you get caching, invalidation, query states, refetching logic; all under one nice library [3]. It’s not bad!

Zod [8] handles our input/output validation and serialization, you define your input/output schemas and it integrates with tRPC.

we use Drizzle [9] for our Database ORM, it’s nice because it offers sql-like syntax, do you’re not too far removed from real SQL. You define your table schemas in a schema.ts file and you’re good. It also helps generate migrations for you, or you can just push schema changes for fast development.

we also have various integrations like Ably for Pub/Sub style events, Sentry for observability, Send Grid for email, and even an AI integration for generating dashboard widgets for analysis.

In the Beginning

When you’re starting out your just trying to build your most fundamental domain models with attributes that are relevant to the core logic of the app. you might map out your schemas for those objects and what they look like what varies across each instance, what stays the same, what sorts of operations will be supported with them, what relationships are there between each (e.g. many-to-many, many-to-one, one-to-one).

You might start with some example UML models you map to your drizzle table schema definitions. and model their foreign key constraints, in the beginning there aren’t many constraints, but as the codebase grows you start tracking events and one-to-one-to-many relationships

You eventually land on basic models and you build simple tables so that you have a UI for all these models to manage at quick glance.

You might build special UIs for what you’re product is trying to solve, A UI for workflow management another UI for any thing that happens geographically in the real world so you add GPS and tracking on a map.

you might need to start tracking events and histories of your objects so you build more tables to track changes, this of course leads to needing reports and dashboards so you build out those so users can get reports of what happened yesterday, last week, last month. Your dashboard shows your main KPIs.

you want to send emails for these reports and also digests for important events. So now you need a way to send emails, you integrate an email service provider (send grid).

You want important events sent in real time, critical domain model’s status was changed (canceled, updated, etc), a system error occurred, a live event happened that needs to be pushed in real-time, etc. You now have Ably for Pub/Sub to send these things.

If you rely on critical third party systems like we do for RFID, now you have to build a system for the health of the system as a whole. to ensure the system is behaving and your critical infrastructure is working, for instance if you rely on on-the ground sensors, hardware, RFID scanners; then you have to build a UI for that.

Is Ably working and Events coming through? ✅
Are scanners getting tag reads? ✅
Are we getting this telemetry forward to us over Cellular? ❌ ok, well we know the problem

You want a way to onboard customers quickly so you build out CSV import/export system, another schema to support, and - if you want a smart system then that takes extra effort - we built a cool wizard for importing large datasets that uses AI to identify columns, values, and constraints that can be changed to complete the mapping.

And if your domain is in logistics, then you might need a mobile app too for those on the ground. So we built our web app to be a Progressive Web App (PWA), so that drivers can download it to their phones and use our website like a mobile app. We serve everything with responsive design and we show special pages/UI on the mobile devices.

oh and guess what?! you’ve got process.env littered all throughout your code, but really you need to abstract that because you need to use your type system to parse you .env and ensure your variables are semantically correct in the system, you also need a central way to configure the various parameters of your system during run-time and before the system comes online, so now you need a system config singleton that is shared across the app and centralizes a lot of that dispersed logic and it offers type validation so you can’t set one of your environment variables wrong or the system won’t work or if you forget it comes with sane defaults, it’s not universally correct but it helps with a number of problems.

CI/CD

You can build anything locally, but how do you get it out onto the wider web? You need to deploy it with a domain name that anyone can go to. Of course you also want to be able to make changes, push those changes, and have those changes tested and deployed in rapid iterations of development, so then you setup your CI/CD pipeline: setup the the test environment, install your dependencies, run the tests, return results. You also want to deploy when code is merged into your dev and main branches, in our case we have nice integrations with Vercel so that when code is pushed to specific branches it auto deploys to the environments in Vercel.

Every environment is a separate git branch, branches are just references to commits. So anytime we push code to our Main branch as a new release, we can auto update those individual client branches to the lastest commit and it will auto deploy for us. This is how we made our SaaS a multitenant application by just deploying a new branch with a new domain for the client.

On Boarding New Customers

You also need an automated process to onboard customers. Customer sign up, which can perhaps be a lengthy process in this sort of domain because we need to automate ingestion of existing data: current yard assets, current employees, yards and locations themselves, getting the docks and spots of those yards, getting geofence boundaries, and more I probably can’t even think of right now. Get billing figured out and how to tie that to system features that can configured. Productizing this is still being figured out, I helped setup a Demo environment so any customer can get an example of the app and how it’s used.

Going Live and Getting Customer Feedback

This is where rubber meets the road. Now you have to get customer feedback, listen to the small and big complaints, jot everything down to help improve the system. Especially if it’s a big system then bringing it all online for the first time reveals a lot of flaws, a lot of aspects that need even further polish. Turns out our system health telemetry wasn’t enough, we weren’t getting strong consistent results and we just simply didn’t have enough incites on the health page. Our UI tables needed a complete refresh so that everything was filterable/sortable, we needed better tracking into events, we needed better initialization and an onboarding of the RFID system and devices. Figure out the edge cases of how these RFID tags will work, do we need different types of tags? How does the mobile view actually work for drivers? does it need to do special things for supervisor too?

Tons of questions and lots of intense feedback and iteration loops with the product. This takes time and is part of the late stage QA process of refinement.

Conclusion, Mistakes, and Growth

Building a SaaS from the ground up is hard! So many pieces to think about, and I didn’t even cover it all here. I learned so much stuff in this project, stuff that was just figured out for me at previous companies. list of learnings:

NextJS
Vercel
NeonDB
tRPC
Learned about Serverless
Modern React
how to architect a multitenant system with a large list of features
Became more familiar with TypeScript

I made a number of mistakes along the way. a big one was in my lack of understanding of serverless and how Vercel works, this is critical because Vercel doesn’t actually use a long-running server, or at least that’s not exposed to you in your application, it spins up shortlived functions that serve the request and then are kept hotloaded for a while before shutting down. So there’s no “server” it’s just functions running, and that actually bit me because I was designing the event bus system to have long running listeners initially for the subscribers, but when I pushed this to Vercel of course this didn’t work as expected, the listeners were being shutdown after a while. so it necessitated a 3rd party solution for delivering events to subscribers via Ably directly to clients and other webhooks.

Of course this architecture is important because it presents a number of tradeoffs, this means you don’t have long lived in-memory data structures. So now you have startup times that actually might start to matter for the performance of the system. There are ways this can be managed if the performance does start to becomes a major issue by setting up an in-memory cache service that keep data hotloaded.

References

https://dafoster.net/articles/2025/07/22/designing-software-in-the-large/
https://news.ycombinator.com/item?id=13964720
https://www.cosmicpython.com/book/chapter_01_domain_model.html
https://news.ycombinator.com/item?id=45151622
https://trpc.io
https://nextjs.org
https://fastapi.tiangolo.com
https://neon.tech
https://tanstack.com/query
https://zod.dev
https://orm.drizzle.team